Air pollution is currently one of the most serious public health worries worldwide. The data is contained in the file ozone.txt.
library(FactoMineR)
library(reshape2)
library(ggplot2)
library(png)
library(EBImage)
library(ripa)
## Loading required package: tcltk
## Loading required package: parallel
##
## Attaching package: 'ripa'
## The following object is masked from 'package:EBImage':
##
## normalize
Ozone <- read.table("ozone.txt",header = TRUE, sep = " ", row.names = 1)
res.pca <- PCA(Ozone , quanti.sup =1,quali.sup = 12:13,graph=FALSE,ncp=15)
plot(res.pca, choix = "var", axes = c(1, 2),label = "quanti.sup")
plot(res.pca, choix = "ind", habillage = ncol(Ozone), cex = 0.8,label = "quali")
plot(res.pca, choix = "var", axes = c(1, 2),label = "var",col.var="brown")
We can separate our variables into 3 high within-correlation sets:
The end points of the projected vectors are very close to the circle. It means the norm of the vectors hasn’t been affected that much by the projection. Overall, the PCA preserved the variance of the Ozone data.
res.pca$eig[1:2,]
## eigenvalue percentage of variance cumulative percentage of variance
## comp 1 5.494434 54.94434 54.94434
## comp 2 1.844963 18.44963 73.39397
Dim 1 stands for 54.94% of the variance distribution.
Dim 2 stands for 18,45% of the variance distribution.
Overall, 73.39% of the variance distribution is preserved, which is satisfying.
PCA enables us to graphically approximate the variables correlation. Since regression is only meaningful for correlated variables, we can perform PCA before performing regression.
Besides we can evaluate beforehand what the general tendendy of the regression will be. For instance, maxO3 is positively correlated with T15, but negatively correlated with Ne15.
With the PCA outputs, one could reconstruct the data. First, reconstruct the data with one dimension. You should be aware that with the results of PCA, you could reconstruct the centered and scaled data, so that you may need to add the means and multiply by the standard deviations to get the approximation of the original matrix.
Here we manually reconstruct
# Data on the first dimension
ind_proj<-(matrix(res.pca$ind$coord[,1]))
# Variable on the first dimension
var_proj<-t(matrix(res.pca$var$coord[,1]))
Ozone_reconstruct <- ind_proj %*% var_proj
Ozone_r <- data.frame(Ozone_reconstruct)
colnames(Ozone_r) <- names(Ozone)[2:11]
numO <- Ozone[,2:11]
moyO <- apply(Ozone[,2:11],2,mean)
sdO <- apply(Ozone[,2:11],2,sd)
Ozone_r <- sweep(Ozone_r,2,sdO,FUN="*")
Ozone_r <- sweep(Ozone_r,2,moyO,FUN="+")
plot(Ozone[,"maxO3y"])
plot(Ozone_r[,"maxO3y"])
Calling Xˆ the reconstructed matrix, observe the difference between X and Xˆ by plotting the variable maxO3y for the two matrices in function of the observations. Comment.
Ozone_m <- data.frame(X=1:nrow(Ozone), max03 =Ozone[,"maxO3y"], max03_r = Ozone_r[,"maxO3y"])
Ozone_m <- melt(Ozone_m, id.vars = "X")
ggplot(Ozone_m, aes(x = X, y = value, color = variable)) +theme_bw() +geom_line()
The variance of the maxO3 variable has mostly been preserved.
We then observe maxO3y = f(Wx15) after projection…
ggplot(Ozone_r, aes(x = Wx15, y = maxO3y))+geom_line()
The linear relation obtained here is not surprising. (We projected on a one-dimensionnal space).
Below is the original relation between those two variables.
ggplot(Ozone, aes(x = Wx15, y = maxO3y))+geom_line()+geom_smooth()
Note : I tried to write a reconstruct function by hand, by it seems it doesn’t fit with the reconst function provided with the PCA
reconstruct <- function(d){
# Data on the first dimension
ind_proj<-(as.matrix(res.pca$ind$coord[,1:d]))
# Variable on the first dimension
var_proj<-t(as.matrix(res.pca$var$coord[,1:d]))
Ozone_tmp <- ind_proj %*% var_proj
Ozone_tmp <- data.frame(Ozone_tmp)
Ozone_tmp <- sweep(Ozone_tmp,2,sdO,FUN="*")
Ozone_tmp <- sweep(Ozone_tmp,2,moyO,FUN="+")
return(Ozone_tmp)
}
We will represent maxO3y, for the original and the reconstructed data (observed and estimated).
represent <- function(Ozone_r){
Ozone_m <- data.frame(X=1:nrow(Ozone), max03y = Ozone[,"maxO3y"], max03y_r = Ozone_r[,"maxO3y"])
Ozone_m <- melt(Ozone_m, id.vars = "X")
ggplot(Ozone_m, aes(x = X, y = value, color = variable)) +theme_bw() +geom_line()
}
represent(reconst(res.pca,ncp=2))
represent(reconstruct(2))
represent(reconst(res.pca,ncp=3))
Already very close to the original data.
represent(reconst(res.pca,ncp=5))
Perfectly identical to the original data.
Ozone_tmp <- reconst(res.pca,ncp=10)
# The data is already renormalized thanks to reconst
# Otherwise, we could have done :
# Ozone_tmp <- sweep(data.frame(Ozone_tmp),2,sdO,FUN="*")
# Ozone_tmp <- sweep(Ozone_tmp,2,moyO,FUN="+")
represent(Ozone_tmp)
If we wanted to have the coordinate of max03, along the PCA components
res.pca$quanti.sup$coord[,1]
## [1] 0.8215545
Import the data missing.jpg or breton.png and reconstruct the image with 1 to 100 dimensions. There are different packages dedicated to images. With the following lines of code, you could read an image, reprensent it and then transform it in a grey scale. (Be careful that you may need to execute the code in the R console and not by knitting).
img <- readImage("missing.jpg")
display(img, method = "raster")
r <- imagematrix(img, type = "grey")
display(r, method = "raster")
Using the grey image, reconstruct the image with 1 to 100 dimensions (with PCA and reconst function of FactoMineR). You should use the packages “jpeg” and “png” to export the image with functions
Let’s display for these numbers of PCA components: 10,40,70,100
img.pca = PCA(r, graph=FALSE,ncp = 100)
# 100 is the max number of principal components computed
# reconstr(1:10)[2,]==reconstr(1)[2,]
for (i in seq(10,100,30)){
display(reconst(img.pca,ncp=i),method = "raster")
}
The folder reconstruct will hold all the 100 images.
dir.create("reconstruct")
## Warning in dir.create("reconstruct"): 'reconstruct' already exists
for (i in seq(1,100,1)){
png(filename=paste("reconstruct/",toString(i),".png",sep=""))
display(reconst(img.pca,ncp=i),method = "raster")
dev.off()
}
We already get the overall shape of the picture using about 10 components from PCA.
We can clearly read the picture using 40 components.
Of course, the more elements, the clearer the picture, and we need approximately 100 components to distinguish the word “FOUNDATION” from the poster.